Extracting and Managing Structured Web Data
نویسندگان
چکیده
Extracting and Managing Structured Web Data Michael John Cafarella Co-Chairs of the Supervisory Committee: Professor Oren Etzioni Computer Science and Engineering Professor Dan Suciu Computer Science and Engineering The Web contains a large amount of structured data embedded in natural language text, two-dimensional tables, and other forms. This “Structured Web” of data is vast, messy, and diverse; it also promises new and compelling applications. Unfortunately, existing tools such as search engines and relational databases ignore Structured Web data entirely. This dissertation identifies four design criteria for a successful Structured Web management system. Such systems are: 1. Extraction-Focused They obtain structured data wherever it can be found. 2. Domain-Independent They are not tied to one particular topic area. 3. Domain-Scalable They can effectively manage many domains simultaneously. 4. Computationally-Efficient They can handle the Web’s enormous size. We also describe three working Structured Web management systems that observe these criteria. TextRunner is an extractor for processing natural language Web text. WebTables extracts and provides applications on top of relations in HTML tables. Finally, Octopus provides integration services over extracted Structured Web data. Together, these three systems demonstrate that managing structured data on the Web is possible today, and also suggest directions for future systems.
منابع مشابه
Presenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملLinking Semistructured Data on the Web
Many Web data sources and APIs make their data available in XML, JSON, or a domain-specific semi-structured format, with the goal of making the data easily accessible and usable by Web application developers. Although such data formats are more machine-processable than pure text documents, managing and analyzing such data in large scale is often nontrivial. This is mainly due to the lack of a w...
متن کاملA Framework for Extracting, Classifying, Analyzing, and Presenting Information from Semi-Structured Web Data Sources
Extracting information from the web data sources becomes very important because the massive and increasing amount of diverse semi-structured information sources in the Internet that are available to users, and the variety of web pages making the process of information extraction from web a challenging problem. This paper proposes a framework for extracting, classifying, analyzing, and presentin...
متن کاملIntroduction to the Special Issue on Managing Information Extraction
The field of information extraction (IE) focuses on extracting structured data, such as person names and organizations, from unstructured text. This field has had a long history. It attracted steady attention in the 80s and 90s, largely in the AI community. In the past decade, however, spurred on by the explosion of unstructured data on the World-Wide Web, this attention has turned into a torre...
متن کاملExtracting and Re-using Structured Data from Wikis
This report investigates simplifying the creation of structured data for use in Semantic Web applications. In the first phase of work, a prototype is created that extracts structured data on companies and unstructured data on acquisitions from Wikipedia. It then reuses this information in a data browser that can provide faceted, map and timeline views. In the second phase, we investigate more g...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009